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INFERENCE SEQUENCING BY HYBRIDIZATION 

This application for patent under 35 U.S.C. § 111(a) claims priority to 
Provisional Application Serial No. 60/063,103, filed October 24, 1997 under 35 U.S.C. 
§ 111(b). This invention was made with Government Support under Contract Number 
DGE-9452651 awarded by the National Science Foundation. The Government has 
certain rights in the invention. 

BACKGROUND OF THE INVENTION 

Procedures involving use of Sequencing by hybridization (SBH) are known to 
those skilled in the art, and have recently been demonstrated to be useful as a powerful 
alternative to electrophoretic methods for diagnostic DNA analysis [M. Chee et ai, 
Accessing Genetic Information With High Density DNA Arrays, Science 274, 610 
(1996); J. Hacia et ai, Detection Of Heterozygous Mutations In BRCA1 Using High- 
Density Oligonucleotide Arrays And Two-Color Fluorescence Analysis, Nature 
Genetics 14, 441 (1996)]. Diagnostic SBH employs hybridization of a target DNA 
sequence to a tiled array of several thousand short oligonucleotide probes of known 
sequence [W. Bains et a/., A Novel Method For Nucleic Acid Sequence 
Determination, J Theor Biol, 135, 303 (1988); E. Southern et ai, Hybridization With 
Oligonucleotide Arrays, Genomics, 13, 1008 (1992)]. The pattern of hybridization, 
detected by fluorescence microscopy, indicates which oligonucleotides in the probe 
array are present in the target DNA. When this information is compared against a 
reference target sequence, the entire sequence of the target DNA can be reconstructed 
at high accuracy. 

SBH, while offering great advantages in terms of throughput for diagnostic 
sequence analysis, suffers from the drawback that a different probe array must be 
tailored for each target DNA analyzed [M. Chee et ai, Accessing Genetic Information 
With High Density DNA Arrays, Science, 274, 610 (1996)]. For de novo sequencing, 
current SBH methods are not competitive with electrophoretic sequencing techniques 
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that yield 600-1000 base pair read lengths per experiment [P. Pevzner et al, Improved 
Chips For Sequencing By Hybridization, J Biomolecular Struct Dyn, 9, 399 (1991)]. 
Even under perfect experimental conditions, existing SBH designs cannot reconstruct a 
unique target sequence from hybridization data alone [P. Pevzner et al, Towards DNA 
Sequencing Chips, In I9 h Int ConJ Mathematical Foundations of Computer Science, 
Lecture Notes Iin Computer Science, Springer-Verlag, Berlin, Vol 841, pp. 143-158 
(1994); P. Pevzner, Rearrangements Of DNA Sequences And SBH, Computers Chem. 
18, 221 (1994)]. Without a reference sequence for comparison, de novo SBH is 
fundamentally limited because it acquires base sequence information at the cost of 
positional information. One knows exactly which subsequences (probe sequences) are 
present in the target DNA but not where they are located. 

Subsequences must be arranged by examining how they overlap with one 
another. For example, octa-nucleotides are assembled into longer sequences by finding 
corresponding seven-base overlaps. The accuracy of reassembly is limited because any 
particular subsequence can occur more than once in the target DNA, leading to an 
ambiguity in the final reconstructed sequence [P. Pevzner et al, Towards DNA 
Sequencing Chips, In it Int Conf Mathematical Foundations of Computer Science, 
Lecture Notes Iin Computer Science, Springer-Verlag, Berlin, Vol 841, pp. 143-158 
(1994); P. Pevzner, Rearrangements Of DNA Sequences And SBH, Computers Chem, 
18, 221 (1994)]. De novo SBH designs typically use the complete set of all possible 
oligonucleotide probes of a given length [W. Bains et al, A Novel Method For 
Nucleic Acid Sequence Determination, J Theor Biol, 135, 303 (1988); N. Broude et 
al, Enhanced DNA Sequencing By Hybridization, Proc Natl Acad Sci USA, 91, 3072 
(1994); R. Drmanac et al, DNA Sequence Determination by Hybridization: A 
Strategy For Efficient Large-Scale Sequencing, Science, 260, 1649 (1993); R. Drmanac 
et al, Sequencing Of Megabase Plus DNA By Hybridization: Theory Of The Method, 
Genomics, 4, 114 (1989)]; the use of longer probes increases reconstruction accuracy 
but requires the use of very large arrays (> 10 9 probes), since the number of required 
probes increases exponentially with probe length. Even assuming perfect hybridization, 
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an SBH array containing all ~10 6 possible 10-mers would, reliably be able to sequence 
only about 600 bp of target DNA in a single experiment. As longer target DNA 
sequences are attempted, the reconstruction accuracy drops precipitously. For a 
detailed discussion of SBH reassembly algorithms and their limitations see references 
[P. Pevzner et al, Towards DNA Sequencing Chips, In 19" Int Conf Mathematical 
Foundations of Computer Science, Lecture Notes Iin Computer Science, Springer- 
Verlag, Berlin, Vol 841, pp. 143-158 (1994); P. Pevzner et al, Improved Chips For 
Sequencing By Hybridization, J Biomolecular Struct Dyn, 9, 399 (1991); P. Pevzner, 
Rearrangements Of DNA Sequences And SBH, Computers Chem, 18, 221 (1994)]. 

All de novo SBH strategies proposed thus far are direct methods, that is, they 
directly probe the target DNA with oligonucleotides whose sequences are then 
assembled into longer fragments. Since it is currently not feasible to manufacture a 
probe array containing more than approximately 10 6 tiled oligonucleotide probes, a 
direct de novo SBH approach cannot outperform electrophoretic sequencing in terms of 
read, length and reaaccuracy. 

IN THE FIGURES 

The present invention will become better understood with reference to the 
following description, appended claims, and accompanying figures where: 

Figure 1. Basic concepts underlying the design of an Inference Sequencing by t 
Hybridization (ISBH) probe arr ay. A target, in this ca se a fo^ujdjgjt jhone number , is 
c haracterized bv a degenerate probe array with a fou r-fold redundancy using only 64 rf 
jBtobes. The probe array does not detect any digit directly, but the information 
gathered is sufficient to unambiguously infer the identity of the number. A probe 
array based on conventional SBH designs capable of acquiring the same information 
would require 10,000 probes, one for each phone number. 

Figure 2. General scheme for ISBH. A lon g single-stranded target DNA is 
sheared into short oligomiclgotid^ ™A hyhridized to an ISBH probe array. The 
pattern of hybridization |s used to create a set of degenerate 16-mers that characteri ze 
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the target DNA. Information from this de generate set is used by an inference 
algorith m to produce a set of explicit 16-merg, The set of explicit 16-mers produced 
by the inference algorithm contains all 16-mers actually present in the target DNA 
sequence as well as "false positive" 16-mers that are not in the target. A data 
reduction algorithm is then used to eliminate the false positives from the set of explicit 
16-mers. The explicit 16-mers that remain after data reduction are then reassembled at 
high accuracy into contiguous sequence. 

Figure 3. Design of the ISBH probe array of degenerate 16-mers used in this 
study. The array consists of 25 different probe groups. Each group pattern represents 
2 16 =j 65, S16 degenerate 16-mers . for a total of 1,638,400 probes. Each probe in the 
array represents 65,536 explicit 16-mers. Under ideal conditions, a single target 
16-mer will hybridize to exactly one probe in each probe group. 

Figure 4. Example of how false positives are generated by inference. 
Tetranucleotides from a target DNA are characterized by an ISBH probe array of . 
degenerate 4-mers with two-fold redundancy (R = A or G; Y = C or T; W = A or T; 
S = C or G). The inference algorithm generates all valid combinations of data from 
the probe array to produce a set of nine inferred tetranucleotides, six of which are false 
positives. In general, the number of false positives generated by inference decreases 
with the number of probes used in the ISBH array. If the ISBH array used in this 
example had 16 additional probes, then no false positives would have been generated. 

Figure 5. Data reduction after inference for the ISBH array of degenerate 
16-mers used in this study. Seventy-six target DNA sequences downloaded from 
GenBank comprising a total of 2.45 million bases were tested by computer simulation. 
The number of 16-mers generated by inference increases as a power law function of 
the number of different 16-mers in the target DNA (filled circles). Data reduction 
reliably eliminated all but a handful of the false positives for all target lengths 
investigated (open triangles), even when false positives comprised more than 99% of 
the inferred 16-mer set. 

Figure 6. Absolute performance of the ISBH sequence reassembly algorithm. 
The reassembly algorithm returns a target DNA as several non-overlapping fragments 
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in the range of several hundred to several thousand bases in length. The largest 
fragment reconstructed in this study was 28 kilobases, and fragments longer than lOkb 
were commonly observed. Reconstructed fragments always show 100% identity to 
some region of the target DNA. 

Figure 7. Total target coverage in a simulated ISBH experiment. ISBH 
typically covers more than 95% of a target DNA in a single hybridization experiment 
in fragments that are longer than 500 bases. 

Figure 8. Summary of ISBH simulation results. All lengths are in bases. 
Locus and definition for each sequence are shown exactly as they appear in GenBank. 
The number different 16-mers in a target is defined as the number of 16-mers which 
have different base sequences. The fraction of target 16-mers that are repeated is 
given by: 1 - [(number of different 16-niers in target)/(target length - 15)], which is a 
quantitative measure of repetitiveness of the sequence. The fraction of the inferred set 
that are false positive is given by: 1 - [(number of different 16-mers in 
target)/(number of 16-mers in inferred set)]. 

SUMMARY OF THE INVENTION 

The present invention is directed to a method that satisfies the above mentioned 
problems by introducing a new SBH implementation for de novo sequencing, which 
we term Inference Sequencing by Hybridization (ISBH). The basic concepts 
underlying ISBH are illustrated in Figure 1 with an example of how to determine the 
last four digits of a phone number without detecting any digit directly. A conventional 
SBH strategy would require 10 4 - 10,000 probes (one for each phone number), 
whereas the ISBH approach gathers the same information using only 4x2 4 - 64 
degenerate probes. The digit groupings in Figure 2 are analogous to familiar base 
groupings such as purine/pyrimidine (R/Y), amino/keto (M/K), and weak/strong (W/S). 

ISBH is an indirect strategy that uses several small arrays (65,536 probes each) 
to closely approximate the information that would be gathered from a single SBH 
array containing -4.3 billion probes (Figure 2). Our strategy rel ies on degenerate probe 
a r™ys_th at_are similar to binary SP " Him* proposed hv Pevzner et. ql TP. Pevzner et , 
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al.. Towards DNA Sequencing Chips, In 19* Int Conf Mathematical Foundations of 
Computer Science, Lecture Notes Iin Computer Science, Springer- Verlag, Berlin, Vol 
841, pp. 143-158 (1994)]. Unlike conventional SBH, whose accuracy drops with 



METHODS 

The inventors have simulated the following laboratory experiment: 1) a single 
stranded target DNA of unknown sequence is sheared into overlapping oligomers 16 
bases long. 2) These oligomers are hybridized to an ISBH degenerate probe array and 
15 this pattern of hybridization is detected. 3) the hybridization data is reconstructed by 
computer algorithm into contiguous sequence. 

An ISBH probe array containing 1.64 million different degenerat e 16-mers was 
designed. This array is 25-fold redundan t - it consists of 25 groups of 2^=65,536. 
(64K) different probe oligomers. Figure 3 shows the identity of each degenerate probe 
20 group in the array. E ach group of 64K degenerate prob esiscapable ^of hybridizing to 
all possible explicit 16-mers, Thus, a single explicit 16-mer will, under ideal 
conditions, hybridize to exactly 25 different degenerate probes in the ISBH array. 

Simulations of ISBH experiments and subsequent data analysis were performed 
on a Silicon Graphics Origin2000 supercomputer (Boston University Center for 
25 Computational Science) with code written in C as follows: 1) Each target DNA 

sequence wa s retrieved from GenBank (http:// www2.ncbi.nlm.nih.gov/genbank) and 

broken up into all possibl e component 16-mers. This set of 16-mers was then shuffled 
< — — . — ■ ■ ' ~ 

to destroy a ny positional information., 2) E ach 16-mer from the target DNA was 
compared against all 1.64 million degenerate probes in the ISBH array, simulating 
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ideal hybridization. If a 16-mer from the target was found to hybridize with one of 
the degenerate probes, then the hybridization was considered to be a signal arising 
from that particular probe in the ISBH array. 3) The 16-mers from the target DNA 
were then discarded, to account for the fact that in a real experiment, the target DNA 
is of unknown sequence. 4) The signals from the ISBH array were collected to 
produce a set of degenerate 16-mers, which is the non-repeated set of degenerate 
16-mers present in the target DNA. We emphasize that the only data retained is the 
signal pattern from the ISBH array, as would be detected in an actual ISBH 
experiment. The degenerate set contains no information about the explicit base 
sequence of any 16-mer present or its position in the target DNA sequence. 

Sequence Reconstruction 

1) Inference. A set of explicit 16-mers is inferred from the set of degenerate 
16-mers detected by the ISBH array. The inference is accomplished by testing every 
possible explicit 16-mer against the degenerate set. Any particular explicit 16-mer is 
included in the inferred set only if exactly 25 corresponding degenerate 16-mers are 
present in the data from the ISBH array. Under conditions of ideal hybridization, the 
inferred 16-mer set is always a superset of the set of 16-mers present in the target 
DNA sequence. The inferred set usually contains false positive 16-mers, ones which 
are not actually present in the original target DNA sequence. The number of these 
false positives increases with the length and repetitiveness of the target DNA 
(Figure 4). 

2) Primary Data Reduction. Since the inferred set of explicit 16-mers contains 
an unknown number of false positives, a data reduction step is required to eliminate 
them. Every 16-mer in the inferred set is examined to determine if it overlaps by six 
bases at least one other 16-mer in the inferred set on both its 3' and 5' ends. If both 
overlaps are not found the 16-mer is discarded. This procedure is repeated iteratively 
on the resultant set of 16-mers, each time examining an overlap one base longer than 
was used for the previous iteration, until an overlap of fifteen bases is reached. The 
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set that remains is then iterated using fifteen-base overlaps until four or fewer 16-mers 
are discarded at each iteration. 

3) Secondary Data Reduction. All possible reconstruction ambiguities are 
eliminated by comparing the 3*-ftfteen bases of each 16-mer with the 3'-fifteen bases 
of every other 16-mer in the set from step 2 above. If two or more 16-mers are found 
to have identical fifteen base 3'-ends, then they are all discarded from the data set. A 
similar procedure is used to compare the 5*-fifteen bases of the 16-mers and eliminate 
any duplication. 

4) Sequence Reassembly. The 16-mers remaining in the inferred set are then 
assembled into longer sequences. This is done by comparing the 3*-fifteen bases of 
each 16-mer to the 5'-fifteen bases of every other 16-mer in the data set. If a match is 
found, then the two 16-mers are combined into a single 17-mer. The two terminal 
15-mers on either side of this newly formed 17-mer are compared for overlap with the 
remaining 16-mers in the data set and the process is repeated until no more overlaps 
are found. To insure reconstruction accuracy, only fragments at least 100 bases in 
length were considered to be part of the target sequence. 

RESULTS 

Inference Algorithm 

ISBH simulations were performed on 76 target sequences obtained from 
GenBank comprising a total of 2.45 million bases, ranging from 5 to 100 kilobases in 
length. The size of the inferred 16-mer pool increases as a power law function of the 
number of different 16-mers in the target DNA. Data reduction reliably eliminates all 
but a handful of the false positives for all target lengths investigated, even when false 
positives accounted for more than 99% of the inferred 16-mer pool (Figure 5) . The 
set of inferred 16-mers remaining after data reduction closely approximates the 
information that would be gathered from an SBH array containing all explicit 16-mers 
(-4.3 billion probes). 
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Sequence Reconstruction 

The ISBH reconstruction algorithm typically returns a target DNA as several 
non-overlapping fragments that are .in the range of several hundred to several thousand 
bases in length. Reconstructed fragments always show 100% identity (by BLASTN) 
to some region of unknown position in the target DNA sequence. The largest single 
fragment reconstructed was 28 kb, and fragments longer than lOkb were commonly 
observed (Figure 6). In most cases, more than 95% of the target DNA was recovered 
in a simulated ISBH experiment (Figure 7). ISBH performs optimally on target DNAs 
having no repeated 16-mers, generally returning a handful of long (3-15 kb) fragments. 
Even on sequences with many repeated 16-mers, ISBH returns dozens of fragments 
shorter than 5 kb, which is five to ten times the performance of electrophoretic 
sequencing methods. Target DNAs longer than 50 kb tend to produce large numbers 
of false positives in the inference step, a few of which remain after data reduction. 
False positives introduce ambiguities during reassembly, leading to lower average 
lengths for reconstructed fragments. A comprehensive list of each sequence analyzed 
as well as a summary of the ISBH simulation results are shown in Figure 8. 

DETAILED DESCRIPTION OF THE INVENTION 

While this invention is satisfied by embodiments in many different forms, there 
will herein be described preferred embodiments of the invention, with the 
understanding that the present disclosure is to be considered exemplary of the 
principles of the invention and is not intended to limit the invention to the 
embodiments illustrated and described. The scope of the invention will be measured 
by the appended claims and their equivalents. 

A benefit of the present invention, in contrast to conventional SBH, is that the 
inventor's ISBH method has the potential to sequence very long targets at high 
accuracy, using an oligonucleotide array of moderate size. The hypothetical ISBH 
array studied here could easily sequence 15-45kb of DNA in a single experiment. The 
ISBH method requires no electrophoresis, no information about the target DNA, and 
could be used for diagnostic as well as de novo applications. In a single experiment, 
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ISBH could generate more sequence data than two dozen Sanger sequencing reactions 
after shotgun subcloning of a target DNA. In the best cases, each fragment 
reconstructed by the ISBH method can outperform electrophoretic methods by 28-fold. 
In the worst cases, ISBH performance is equivalent to electrophoretic methods. ISBH 
reconstruction of a single target DNA generally required less than 10 minutes of 
supercomputer time to complete. The computational complexity of the inference step is 
of order N 2 , while the data reduction and reassembly steps are of order N log(N). 
Sequence reconstruction using a highly streamlined ISBH algorithm running on a 
typical desktop computer could be completed in a few hours. 

For DNA of random sequence, a given 16-mer should appear once every 4 16 » 
4.3x10' bases, a 15-mer once in 4' 5 « 10' bases, and a 14-mer once in 4 M * 2.7xl0 8 
bases. We note however, that DNAs from a wide variety of organisms in the range of 
10-100 kb typically have hundreds or thousands of repeated 16-mers. As shorter 
subsequences are examined, the number of repeated subsequences increases 
dramatically. For example, the 48.5 kb genome of bacteriophage lambda, which has 
no repeated 16-mers, has a single repeated 15-mer, and ten distinct 14-mers appearing 
more than once. This would suggest that any form of de novo SBH using 
oligonucleotide probes shorter than 16 bases will perform poorly on target DNAs 
longer than a few kilobases. ISBH appears to perform optimally both in terms of 
absolute read lengths and relative target coverage on DNAs in the range of 25kb that 
have small numbers of repeated 16-mers. ISBH is a scalable technique - the number 
of false positives generated by inference increases as the number of probes used in the 
ISBH array decreases. An ISBH array smaller than the one examined here (e.g., 12 
probe groups using 12x2 I6 =7.86xl0 5 probes) would still sequence with 100% accuracy, 
but would return a target as shorter fragments. 

While ISBH under ideal conditions would appear to provide an enormous gain 
for de novo sequencing over conventional SBH and electrophoretic methods, several 
daunting technical obstacles remain. Each degenerate probe is actually a mixture of 
many individual probes that are bound to the same area in an SBH array. The binding 
capacity of such a degenerate probe is greatly reduced in comparison to a pure 
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individual probe - for the hypothetical ISBH array in this study, the complexity of 
each degenerate probe is 65,536. Such a high probe complexity may mean that 
accurate physical hybridization cannot be achieved with a high signal to noise ratio. 
Possible solutions to this problem include the use of base analogs to decrease probe 
complexity or the addition of an enzymatic step (e.g., ligation) to augment the 
accuracy of simple physical hybridization; 

Noise contamination of the data set, particularly in terms of false negatives, 
must be studied in greater detail. False positives are easily dealt with in the data 
reduction step, but false negatives (target 16-mers that never appear at all in the 
inferred data set) will have the effect of lowering the mean fragment length during 
reconstruction. Aberrant hybridization also increases the complexity of data processing 
needed for reliable sequence reconstruction; the upper bound for robust sequence 
reconstruction from an actual ISBH implementation is likely to be somewhat lower 
than the ideal situation presented here. 

An ISBH sequencing approach would be very effective for rapid analysis of 
viral and bacterial genomes which are essentially non-repeating. Sequencing of 
double-stranded target DNAs by ISBH is also possible, as is sequencing of a mixture 
of targets. For double-stranded DNA, ISBH performance is equivalent to the case of a 
single-stranded DNA twice as long. If a double stranded target is cleaved by a 
restriction endonuclease before hybridization, ISBH will return the sequence of each 
restriction fragment. If this experiment is repeated using a restriction endonuclease 
with a different recognition site, then the fragments can be aligned relative to one 
another using standard contig reassembly algorithms. The potential of the ISBH 
strategy is so strong that we are now investigating strategies to implement it in 
practice. 

Another embodiment of the ISBH Probe Array, Experimental, and Data 
Analysis Algorithm Design consists of the following: 

Probe Array Design. The proposed array consists of 768K oligonucleotide 
probes divided into three groups: I) all 256K possible 9-base single-stranded 

t 
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sequences, 2) all 256K possible 9-base 5' -overhanging partial duplexes, and 3) all 
256K possible 9-base 3 '-overhanging partial duplexes. 

Target Preparation. The target DNA must be single stranded - it may be 
prepared (for example) by long PCR of a double stranded target using one primer that 
is biotinylated at its 5' end to facilitate purification from the other strand. The 
biotinylated strand may be captured on streptavidin-coated beads or column. 

Experimental Design. Eight distinct oligonucleotides in two groups are 
required as follows. Group I: 5'-ANNNNNNN-3\ 5'-CNNNNNNN-3', 
5'-GNNNNNNN-3\ 5'-TNNNNNNN-3\ Group II: 5'-P-NNNNNNNA-Sdd-3\ 
5'-P-NNNNNNNC-Sdd-3\ 5'-P-NNNNNNNG-Sdd-3\ and 5'-P-NNNNNNNT-Sdd-3'. 
P denotes a phosphate group, Sdd denotes a 3'-dideoxy base connected by a 
phosphorothioate linkage. Sixteen separate reactions are then performed: each 
oligonucleotide from group I is combined with an oligonucleotide from group II and 
then hybridized to the single-stranded target under conditions favoring accurate base- 
pairing. After hybridization has occurred, the oligonucleotides that are still remaining 
in solution (for example) must be removed by size-exclusion column chromatography. 
DNA ligase (and all necessary cofactors) are added to the reaction mixture. After the 
ligation reaction, exonuclease III is added to the reaction mixture to destroy any 
unligated oligonucleotides that are still hybridized to the target. 

Under ideal conditions, the ligation products will be 16-mers of the form: 
5^ANNNNNNNNNNNNNNA-Sdd-3* (and all 15 other permutations of the end 
bases). The ligation products from each of the sixteen reactions are then hybridized 
separately to a probe array as described above. 

Data Analysis. Each of the sixteen probe array hybridization experiments 
described above generates the following data: a set of 9-mers from probe group 1, a 
set of 9-mers from probe group 2, and a set of 9-mers from group 3. The following 
algorithm is used to expand the data: compare each 9-mer from group 2 to each 
9-mers from group 3. If the 9-mer from group 2 has the same 3' two bases as the 5' 
two bases of the 9-mer in group 3 (i.e., they have a two-base overlap), then combine 
the two 9-mers to form a single 16-mer (concatenate the 3' seven bases of the group 3 
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oligo to the 3' end of the group 2 oligo). This newly formed 16-mer is retained only 
if all eight of its 9-base subsequences are present in group 1. This analysis is 
performed for all probe array experiments and all retained 16-mers are collected into a 
single set. This set of 16-mers (which is the inferred set of explicit 16-mers from the 
target) is then subjected to the same data reduction and sequence reconstruction 

algorithms that we have previously described for ISBH. 

Accordingly, this invention is not limited to the particular embodiments 

disclosed, but is intended to cover all modifications that are within the spirit and scope 

of the invention as defined by the appended claims. 
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CLAIMS 

What is claimed is: 

1. A method of testing a nucleic acid target, comprising: 

a) fragmenting a single-stranded target DNA into single-stranded 
target DNA fragments; 

b) testing said fragments so as to generate a signal for each 

fragment; 

c) calculating a first set of N-mers from said signals, each of said 
N-mers having a sequence with 3* and 5* ends; 

d) comparing a portion of the nucleic acid sequence of each of said 
N-mers of said first set with a portion of the nucleic acid sequence of every 
other N-mer in said first set for sequence overlap; and 

e) eliminating each N-mer that is found not to display said overlap, 
so as to create a second set of N-mers. 

2. The method of Claim 1, further comprising the steps: 

f) comparing a portion of the nucleic acid sequence of each of said 
N-mers in said second set with a portion of the nucleic acid sequence of every 
other N-mer in said second set for sequence overlap, wherein the portion 
compared has a length in bases defined by N-l; and 

g) identifying an instance where said portion of one N-mer from 
said second set is found to overlap with said portion of another N-mer from 
said second set, thereby identifying first and second overlapping N-mers. 

3. The method of Claim 1, wherein non-target DNA fragments are added 
to said single-stranded target DNA fragments prior to step (b). 

- 14 - 



9/28/2007, EAST Version: 2.1.0.14 



WO 99/22025 



PCT/US98/22519 



4. The method of Claim 2, wherein N is 16 and said portion compared in 
step (f) is fifteen bases in length. 

5. The method of Claim 4, wherein said fifteen bases compared in step (f) 
are the 3'-fifteen bases of each 16-mer with the S'-fifteen bases of every other 16-mer 
in said second set. 

6. The method of Claim 5, further comprising the step (h) constructing a 
17-mer from the combination of the sequences of said first and second overlapping 
N-mers of step (g). 

7. A method of testing a nucleic acid target, comprising: 

a) fragmenting a'single-stranded target DNA into single-stranded 

fragments; 

b) testing each of said fragments so as to generate a signal for each 
fragment; 

c) calculating a first set of N-mers from said signals, each of said 
N-mers having a sequence with 3* and 5' ends; 

d) comparing a portion of the nucleic acid sequence of each of said 
N-mers of said first set with a portion of the nucleic acid sequence of every 
other N-mer in said first set for sequence overlap; 

e) eliminating each N-mer that is found not to display said overlap, 
so as to create a second set of N-mers; 

f) comparing a portion of the nucleic acid sequence of each of said 
N-mers in said second set with a portion of the nucleic acid sequence of every 
other N-mer in said second set for sequence overlap, wherein the portion 
compared has a length in bases defined by N-l; and 

g) identifying an instance where said portion of one N-mer from 
said second set is found to overlap with said portion of another N-mer from 
said second set, thereby identifying first and second overlapping N-mers. 
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8. The method of Claim 7, wherein non-target DNA fragments are added 
to said single-stranded target DNA fragments prior to step (b). 

9. The method of Claim 8, wherein N is 16 and said portion compared in 
step (0 is fifteen bases in length. 

10. The method of Claim 9, wherein said fifteen bases compared in step (f) 
are the 3'-fifteen bases of each 16-mer with the 5'-fifteen bases of every other 16-mer 
in said second set. 

11. The method of Claim 10, further comprising the step (h) constructing a 
17-mer from the combination of the sequences of said first and second overlapping 
N-mers of step (g). 

12. A method of testing a nucleic acid target, comprising: 

a) fragmenting a single-stranded target DNA into single-stranded 

fragments; 

b) adding non-target DNA fragments to said single-stranded target 
DNA fragments to create a mixture; 

c) testing each of said fragments so as to generate a signal for each 

fragment; 

d) calculating a first set of N-mers from said signals, each of said 
N-mers having a sequence with 3' and 5' ends; 

e) comparing a portion of the nucleic acid sequence of each of said 
N-mers of said first set with a portion of the nucleic acid sequence of every 
other N-mer in said first set for sequence overlap; 

f) eliminating each N-mer that is found not to display said overlap, 
so as to create a second set of N-mers; 
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g) comparing a portion of the nucleic acid sequence of each of said 
N-mers in said second set with a portion of the nucleic acid sequence of every 
other N-mer in said second set for sequence overlap, wherein the portion 
compared has a length in bases defined by N-l; and 

h) identifying an instance where said portion of one N-mer from 
said second set is found to overlap with said portion of another N-mer from 
said second set, thereby identifying first and second overlapping N-mers. 

13. The method of Claim 12, wherein N is 16 and said portion compared in 
step (g) is fifteen bases in length. 

14. The method of Claim 13, wherein said fifteen bases compared in step 
(g) are the 3'-fifteen bases of each 16-mer with the 5'-fifteen bases of every other 

16- mer in said second set. 

15. The method of Claim 14, further comprising the step (i) constructing a 

17- mer from the combination of the sequences of said first and second overlapping 
N-mers of step (h). 
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